In this work, we tackle two vital tasks in automated driving systems, i.e., driver intent prediction and risk object identification from egocentric images. Mainly, we investigate the question: what would be good road scene-level representations for these two tasks? We contend that a scene-level representation must capture higher-level semantic and geometric representations of traffic scenes around ego-vehicle while performing actions to their destinations. To this end, we introduce the representation of semantic regions, which are areas where ego-vehicles visit while taking an afforded action (e.g., left-turn at 4-way intersections). We propose to learn scene-level representations via a novel semantic region prediction task and an automatic semantic region labeling algorithm. Extensive evaluations are conducted on the HDD and nuScenes datasets, and the learned representations lead to state-of-the-art performance for driver intention prediction and risk object identification.
translated by 谷歌翻译
高度期望可以通过视觉信号执行复杂任务并与人合作执行复杂任务的空间AI。为了实现这一目标,我们需要一个视觉大满贯,该猛击很容易适应新场景而无需预训练,并为实时的下游任务生成密集的地图。由于其组件的固有局限性,先前基于学习和非学习的视觉大满贯都不满足所有需求。在这项工作中,我们开发了一个名为Orbeez-Slam的视觉猛烈抨击,该作品成功地与隐式神经表示(NERF)和视觉探测仪合作以实现我们的目标。此外,Orbeez-Slam可以与单眼相机一起使用,因为它只需要RGB输入,从而广泛适用于现实世界。我们验证其对各种具有挑战性的基准的有效性。结果表明,我们的大满贯速度比强大的渲染结果快800倍。
translated by 谷歌翻译
深图像先验(DIP)是一种最近提出的技术,用于通过将重建图像拟合到未经训练的卷积神经网络的输出中来解决成像反问题。与预处理的前馈神经网络不同,相同的倾角可以概括为任意逆问题,从降级到阶段检索,同时在每个任务下提供竞争性能。DIP的主要缺点是,虽然前馈神经网络可以在单个通行证中重建图像,但DIP必须以大量的计算成本逐渐更新数百到数千个迭代的权重。在这项工作中,我们使用元学习来大规模加速基于倾斜的重建。通过学习浸入权重的适当初始化,我们证明了在一系列逆成像任务中的运行时间有10倍的改善。此外,我们证明了一个经过训练以快速重建面孔的网络也将其推广以重建自然图像贴片。
translated by 谷歌翻译
在几次拍摄的仿制学习(FSIL)中,使用行为克隆(BC)来解决少数专家演示的看不见的任务成为一个流行的研究方向。以下功能在机器人应用中至关重要:(1)在包含多个阶段的复合任务中行为。 (2)从少量变体和未对准示范中检索知识。 (3)从不同的专家学习。以前没有工作可以同时达到这些能力。在这项工作中,我们在上述设置的联盟下进行FSIL问题,并介绍一个小说阶段意识注意网络(扫描),以同时检索来自少数示范的知识。扫描使用注意模块识别长度变体演示中的每个阶段。此外,它是根据演示条件的政策设计,了解专家和代理人之间的关系。实验结果表明,扫描可以从不同的专家中学习,而不进行微调和优于复杂的复合任务的基线,可视化可视化。
translated by 谷歌翻译
使用多模式输入的对象检测可以改善许多安全性系统,例如自动驾驶汽车(AVS)。由白天和黑夜运行的AV动机,我们使用RGB和热摄像机研究多模式对象检测,因为后者在较差的照明下提供了更强的对象签名。我们探索融合来自不同方式的信息的策略。我们的关键贡献是一种概率结合技术,Proben,一种简单的非学习方法,可以将多模式的检测融合在一起。我们从贝叶斯的规则和第一原则中得出了探针,这些原则在跨模态上采用条件独立性。通过概率边缘化,当检测器不向同一物体发射时,概率可以优雅地处理缺失的方式。重要的是,即使有条件的独立性假设不存在,也可以显着改善多模式检测,例如,从其他融合方法(包括现成的内部和训练有素的内部)融合输出。我们在两个基准上验证了包含对齐(KAIST)和未对准(Flir)多模式图像的基准,这表明Proben的相对性能优于先前的工作超过13%!
translated by 谷歌翻译
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as Ima-geNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated crossattention models. The representations also enable cross-modality search with complex text and text + image queries.
translated by 谷歌翻译
学习在无人驾驶汽车(UAV)捕获的图像中检测物体(例如人类)通常会遭受无人机对物体的位置造成的巨大变化。此外,现有的基于无人机的基准数据集不提供足够的数据集元数据,这对于精确的模型诊断至关重要,并且学习功能不变。在本文中,我们介绍了大天使,这是第一个基于无人机的对象检测数据集,该数据集由具有相似想象条件以及无人机位置以及对象姿势元数据捕获的真实和合成子集组成。一系列实验经过精心设计,使用最先进的对象检测器设计,以证明在模型评估过程中利用元数据的好处。此外,还提供了几种涉及模型微调过程中涉及真实和合成数据的关键见解。最后,我们讨论了有关大天使的优势,局限性和未来方向,以突出其对更广泛的机器学习社区的独特价值。
translated by 谷歌翻译
对话状态跟踪模型在面向任务的对话系统中发挥着重要作用。然而,它们中的大多数是根据输入定义地独立地造型的插槽类型。我们发现它可能导致模型由共享相同数据类型的插槽类型混淆。为了减轻这个问题,我们提出了连续模型插槽的Trippy-MRF和Trippy-LSTM。我们的研究结果表明,他们能够缓解上述混淆,并将最先进的数据集达到58.7至61.3推出。我们的实现可在https://github.com/ctinray/trippy-joint上获得。
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译